# A Fine-Grain Power-Gated FPGA with an Area Efficient High Speed Time Multiplexed Level Encoded Dual Rail Architecture

Marie Kottayil James, Binu K. Mathew

Abstract— The most important challenge in the IC manufacturing industry is high performance with reduced power, area and cost. Asynchronous FPGAs have the advantages of lower power consumption, lower electromagnetic interference, and better modularity in large systems. This paper focusses on the reduction of dynamic power consumption of FPGAs for which two techniques are employed. Fine-grain power gating methods are used to decrease the power consumption. The implemented architecture projects a granularity size as fine as a single two-input and one-output lookup table. The proposed architecture directly detects the activity of each look-up-table by exploiting the advantageous features of asynchronous architectures. To add to this, the arrival of data is detected in advance and this prevents the increased delay required for the waking up of a logic block and the power consumption due to unnecessary switching. A comparison study between the existent and proposed dual rail encoding architectures shows that the newly implemented logic block with power gating and TM-LEDR encoding occupies lesser area and works at a greater speed as compared to the already existing conventional ones.

**Index Terms**— Asynchronous architecture, Fine-grain power gating, TM-LEDR encoding, **d**ynamic power, FPGA, lookup table, LEDR encoding.

## **1** INTRODUCTION

"HE most important key challenge in the IC scaling era is to deliver high-performance solutions in the process of minimizing power and cost. Programmable logic devices such as field-programmable gate arrays (FPGAs) meet this challenge by providing a cost-efficient solution from low volume to mid volume applications due to low non-recurring engineering costs. Also, with in-field programmability, FPGAs provide a platform solution with faster time to market and longer product lifetime. Despite FPGAs' computational energy efficiency advantage over digital signal processors (DSPs) today, DSPs are widely used in battery- operated applications primarily due to their extensive power management capabilities that enable very low-power consumption during standby. In contrast, existing FPGAs, designed for high throughput, high duty cycle applications, have very few power management features. A Manchester carry chain circuit implemented in CMOS PTL logic is shown in Fig. 1 which has been used to implement arithmetic functions. The function of the Manchester carry chain circuit is: Ck = Gk+Ck-1. Pk for k = 1 to n, where n is the bit number, Gk and Pk are the generate and propagate signals produced from two inputs of the half adder.

On the other hand, FPGAs are cost-effective and flexible since FPGAs consist of programmable logic blocks and programmable switch blocks to make design modifications after production. The major disadvantage of FPGAs is its low performance because of the following reasons.

- 1. The area and delay of a switch block become large since a switch block consists of many programmable switches.
- 2. The time for data transfer between logic blocks becomes large since data from one logic block usually traverse through many switch blocks to reach the other logic block.

This paper presents a low power FPGA that uses an LUT-level power gating technique called autonomous fine-grain power gating. To reduce the dynamic power consumption, we introduce Level Encoded Dual Rail (LEDR) based architecture. Time Multiplexed Level Encoded Dual Rail based architecture is yet another encoding technique proposed, in the view of reducing the area as well as the propagation delay of the entire circuit. The overall circuit occupies lesser area than the currently existing methods; hence these methods are integrated to reduce the power consumption of our FPGA.

# 2 RELATED WORK

\_\_\_\_\_

#### 2.1 Asynchronous FPGAs

Asynchronous or self-timed circuit, is a sequential\_digital logic circuit\_which is not governed by a clock circuit or global clock signal. It is different from a synchronous circuit wherein changes to the signal values in the circuit are triggered by repetitive pulses called a clock signal. Most digital devices today use synchronous circuits. However asynchronous circuits have the potential to be faster, and also have advantages in lower power consumption, lower electromagnetic interference, and better modularity in large systems. Asynchronous encoding schemes are classified into. 1) Bundled-data encoding. 2) Delay insensitive encoding (usually dual-rail encoding).



Figure 1 shows the overall architecture for the bundleddata encoding [5]. The bundled data encoding is the most frequently-used method in ASICs since its hardware overhead is relatively small. This is because the REQ wire is shared among

all the *N* value wires. Hence, to transfer an *N*-bit value, only *N* coding is that it requires slightly complex hardware. + 2 wires are required.



In reconfigurable VLSIs such as FPGAs, delay insensitive encoding is the ideal one [6][7]. The most common delay insensitive encoding is dual rail encoding. Figure 2 shows the overall architecture for dual-rail encoding. Hence, to transfer an *N* bit value, 2N + 1 wires are required. There are two major methods for dual-rail encoding:

- 1. Four phase dual-rail encoding
- 2. Level encoded dual-rail encoding (LEDR)

## 2.2 Four Phase Dual Rail Encoding

Four-phase dual-rail encoding is the type of dual rail encoding mostly used by asynchronous FPGAs, because of relatively small hardware cost. Figure3 shows an example where data values 0, 0 and 1 are transferred. The sender sends spacer (0, 0) after a data value. The receiver knows the arrival of a data value by detecting the change of either bit: 0 to 1. The drawback of the four phase dual-rail encoding is low throughput because of the insertion of spacers.



## 2.3 Level Encoded Dual Rail Encoding

In LEDR encoding, no spacer is required. It enhances the throughput of the delay insensitive encoding. Figure 4 shows the example where data values "0" "0" and "1" are transferred. The sender sends data values alternately in phase 0 and phase 1 [8]. The receiver knows the arrival of a data value by detecting the change of phase. The drawback of the LEDR en-



## 2.4 Power Gating

The power consumption of power gating circuitry is consumed by the sleep controller, the sleep signal distribution network, and the sleep transistors [1]. The fundamental challenge for any power gating technique is to ensure that the saved standby power outweighs the power overhead of the power gating. Power gating techniques are classified into two types:

- 1. Coarse-grain power gating.
- Fine-grain power gating. 2.

In coarse-grain power gating, a large number of lookup tables (LUTs) share a single sleep controller so the area and power overheads of the sleep controller are comparatively small. However, if any LUT within a coarse-grain powergated domain is active, none of the LUTs that share the same sleep transistor can be set to the sleep mode. In fine-grain power gating, each LUT has its own sleep transistor and related sleep controller, so when any LUT is inactive, it can be set to the sleep mode immediately. This will result in a much lower standby power. Thus, fine-grain power gating is assumed to be less efficient than coarse-grain power gating, even though it has the potential to cut most of the standby power compared to coarse-grain power gating.

## **3** ARCHITECTURE

## 3.1 FPGAs Overview

In an asynchronous architecture, it is easily detected whether a LUT is used or not in use [2]. Figure 5 demonstrates the principle of the activity detection using the asynchronous architecture.

When a new data arrives at the Logic Block (LB), the phase of the input data is different from that of the output data. When the operation is complete, the phase of the input data is the same as that of the output data. Considering this fact, the activity of the LB is detected by comparing the phases of the input data and the output data. The activity information can be exploited to power OFF unused LBs and to wake them up [9]. Therefore, the proposed sleep controller compares the phases of inputs and the output data.



The details of the LB are shown in Figure 6. The LB mainly consists of an LEDR to 4-phase converter, a function unit, a 4-phase to LEDR converter, a data arrival detector and a converter controller [3][4].



Figure 7 shows the block diagram of an LB. Each LB mainly consists of an LUT, an output register, a sleep controller, and a C-element. The LUT operates two-input and one output logic functions. The C-element is a state-holding element [10]. The main elements of the sleep controller circuit present along with the logic block are the phase comparator and the programmable units.



The LUT based on a hybrid of decoders and multiplexers are proposed in the paper. The hybrid LUT is used for power gating. Figure 8 shows the block diagram of the proposed LUT, which consists of four sub-modules. Each sub-module consists of a decoder, a multiplexer and a memory bit.



#### **3.2 Circuit Implementation**

The structure of the proposed converter controller is shown in Figure 9. The converter controller is used to generate the phase of LEDR encoding and generate a signal for data arrival detector to generate the phase of four- phase dual rail encoding [11]. The converter controller comprises of simple gates and flips flops.



IJSER © 2013 http://www.ijser.org

Figure 10. A Muller's C-element used to hold states is used in the data arrival detector [12].



The structure of the LEDR to 4-phase converter is shown in Figure 11. The converter generates two bits for one bit data value [13].



The proposed 4-phase to LEDR converter is shown in Figure 12. It also performs as an output register [14]. It consists of two modules. One is Out.V and other is Out.R generating circuit.



The main feature of TM-LEDR signaling is to use one dualrail line to send sequentially two adjacent bits of a word, merging two 2-phase cycles in a 4-phase one [15]. Figure 13 shows the transmission of a four-bit data.



## 4 RESULTS AND DISCUSSION

This paper was synthesized using Leonardo Spectrum LS 2009 a\_6 and the synthesis results obtained are as given below. An estimation of the VLSI parameters like area and propagation delay, based on the temperature and clock frequency set, was obtained.

| TOTAL ACCUMULATED AREA                                             |        |     |  |
|--------------------------------------------------------------------|--------|-----|--|
| Number of Ports                                                    |        | 13  |  |
| Number of Nets                                                     |        | 11  |  |
| Number of Instances                                                |        | 8   |  |
| Number of Gates                                                    |        | 26  |  |
| POWER CONSUMPTION                                                  |        |     |  |
| Power                                                              |        | 7mW |  |
| PROPAGATION DELAY                                                  |        |     |  |
| Data Arrival Time                                                  | 0.48ns |     |  |
| Fig. 14. Synthesis Results for a Logic Cell based on LEDR Encoding |        |     |  |

| TOTAL ACCUMULATED AREA                       |                              |  |  |
|----------------------------------------------|------------------------------|--|--|
| Number of Ports                              | 11                           |  |  |
| Number of Nets                               | 11                           |  |  |
| Number of Instances                          | 6                            |  |  |
| Number of Gates                              | 23                           |  |  |
| POWER CONSUMPTION                            |                              |  |  |
| Power                                        | 9mW                          |  |  |
| PROPAGATION DELAY                            |                              |  |  |
| Data Arrival Time                            | 0.43ns                       |  |  |
| Fig. 15. Synthesis Results for LEDR Encoding | or a Logic Cell based on TM- |  |  |

## **5** CONCLUSION

When applying the proposed fine-grain power gating method to complex datapath architectures, four-phase dual-rail encoding can be most efficiently combined with LEDR encoding. Four-phase dual-rail encoding is best suited for an LUT because of its small area, while LEDR encoding is best employed for the switch block because of its high throughput and low power.

. The TM-LEDR encoding architecture implemented along with power gating presents an area efficient and high speed design as compared to the LEDR architecture.

## REFERENCES

- Tuan T., Rahman A., Das S., Trimberger S., Kao S., "A 90-nm Low-Power FPGA for Battery-Powered Applications," in *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol.26, No.2,2007*, pp 296-300.
- [2] W.-K. Chen, *Linear Networks and Systems*. Belmont, Calif.: Wadsworth, pp. 123-135, 1993. (Book style)
- [3] S. Ishihara, M. Hariyama, and M. Kameyama, "A low power FPGA based on autonomous fine-grain power gating," in *Proc. Asia South Pacific Des. Autom. Conf. (ASP-DAC), Yokohama, Japan, Jan. 2009, pp.* 119–120.
- [4] M. Hariyama, S. Ishihara and M. Kameyama, "A Low-Power Field-Programmable VLSI Based on a Fine-Grained Power-Gating Scheme ," in *Circuits and Systems, 2008. MWSCAS 2008. 51st Midwest Symposium*, pp. 702-705.
- [5] Manohar R., "Reconfigurable Asynchronous Logic," in *Proc. IEEE Custom Integrated Circuits Conference, CICC 2006,* pp 13-21.
- [6] M. Hariyama, S. Ishihara, C. C. Wei, and M. Kameyama, "A field programmable VLSI based on an asynchronous bit- serial architecture," in Proc. IEEE Asian Solid-State Circuits Conf. (ASSCC), Jeju, Korea, Nov. 2007, pp. 380–383.
- [7] M. Hariyama, S. Ishihara, and M. Kameyama, "Evaluation of a field programmable VLSI based on an asynchronous bit-serial architecture," *IEICE Trans. Electron, vol. E91-C, no. 9, 2008, po*"pp. 1419–1426.
- [8] M. E. Dean, T. E. Williams, and D. L. Dill, "Efficient self-timing with level-encoded 2-phase dual-rail (LEDR)," in *Proc. Univ. California/ Santa Cruz Conf. Adv. Res. VLSI*, 1991, pp. 55–70.
- [9] S. Ishihara, M. Hariyama, and M. Kameyama, "A low power FPGA based on autonomous fine-grain power gating," in *Very Large Scale Integration (VLSI) Systems, IEEE Transactions Aug.* 2011, pp. 1394– 1406.
- [10] Moreir M., Oliveira B., Moraes F., Calazans N., "Impact of Celements on Asynchronous Cicuits," in Proc. 13th International Symposium on Quality Electronic Design, ISQED 2012, pp 437-443.
- [11] Y. Komatsu, S. Ishihara, M. Hariyama and M. Kameyama, "An Implementation of an Asynchronous FPGA Based on LEDR/Four-Phase-Dual-Rail Hybrid Architecture," in 16th Asia and South Pacific Design Automation Conference (ASP-DAC), 2011, pp. 89–90.
- [12] Murphy J.P., "Design of Latch-based C-elements," in *Electronics Letters*, Vol.48, No.19,2012, pp 1190-1191.
- [13] K. Naveena and N. Kirthika, "A Low Power Asynchronous FPGA With Power Gating and Dual Rail Encoding," *IJCSET March* 2012, *Vol. 2, Issue 3*, pp 949-952.
- [14] Pedram M., Qing Wu, Xunwei Wu, "A New Design for Double Edge Triggered Flip-flops," in Proc. Asia and South Pacific Design Automation Conference, ASP-DAC 1998, pp 417-421.

[15] Marco Storto and Roberto Saletti, "Time-Multiplexed Dual-Rail Protocol for Low-Power Delay-insensitive Asynchronous Communication" in Proc. University of Pisa, Italy Conf. Adv. Res. VLSI, 1997.